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Improved Speech Model and Analysis, Synthesis, and 



Quantization Methods 



3 Background 

4 The invention relates to an improved model of speech or acoustic signals and methods for 

5 estimating the improved model parameters and synthesizing signals from these parameters. 

6 Speech models together with speech analysis and synthesis methods are widely used in 

7 apphcations such as telecommunications, speech recognition, speaker identification, and speech 

8 synthesis. Vocoders are a class of speech analysis/synthesis systems based on an underlying model 

9 of speech. Vocoders have been extensively used in practice. Examples of vocoders include linear 

10 prediction vocoders, homomorphic vocoders, channel vocoders, sinusoidal transform coders (STC), 
u multiband excitation (MBE) vocoders, improved multiband excitation (IMBE^^^), and advanced 

12 multiband excitation vocoders (AMBE™). 

13 Vocoders typically model speech over a short interval of time as the response of a system 

14 excited by some form of excitation. Typically, an input signal so(n) is obtained by sampling an 

15 analog input signal. For applications such as speech coding or speech recognition, the sampling 

16 rate ranges typically between 6 kHz and 16 kHz. The method works well for any sampling rate 

17 with corresponding changes in the associated parameters. To focus on a short uiterval centered at 

18 time t, the input signal so(n) is typically multiplied by a window w{t,n) centered at time t to 

19 obtain a windowed signal s{t, n). The window used is typically a Hamming window or Kaiser 

20 window and can be constant as a function of t so that w{t, n) = u;o(n - t) or can have 

21 characteristics which change as a function of t. The length of the window w{t, n) typically ranges 

22 between 5 ms and 40 ms. The windowed signal s{t, n) is typically computed at center times of 

23 io, h, ■■■im, Wi> •••• Typically, the interval between consecutive center times tm+i - tm 

24 approximates the effective length of the window w(t, n) used for these center times. The 

25 windowed signal s(i, n) for a particular center time is often referred to as a segment or frame of 

26 the input signal. 

27 For each segment of the input signal, system parameters and excitation parameters are 

28 determined. The system parameters typically consist of the spectral envelope or the impulse 

29 response of the system. The excitation parameters typically consist of a fundamental frequency 

30 (or pitch period) and a voiced/unvoiced (V/UV) parameter which indicates whether the input 

31 signal has pitch (or indicates the degree to which the input signal has pitch). For vocoders such as 
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1 MBE, IMBE, and AMBE, the input signal is divided into frequency bands and the excitation 

2 parameters may also include a V/UV decision for each frequency band. High quality speech 

3 reproduction may be provided using a high quality speech model, an accurate estimation of the 

4 speech model parameters, and high quality synthesis methods. 

5 When the voiced/ unvoiced information consists of a single voiced/unvoiced decision for the 

6 entire frequency band, the synthesized speech tends to have a "buzzy" quality especially 

7 noticeable in regions of speech which contain mixed voicing or in voiced regions of noisy speech. A 

8 number of mixed excitation models have been proposed as potential solutions to the problem of 

9 "buzziness" in vocoders. In these models, periodic and noise-Hke excitations which have either 

10 time-invariant or time-varying spectral shapes are mixed. 

u In excitation models having time-invariant spectral shapes, the excitation signal consists of 

12 the sum of a periodic source and a noise source with fixed spectral envelopes. The mixture ratio 

13 controls the relative amplitudes of the periodic and noise sources. Examples of such models are 

14 described by Itakura and Saito, "Analysis Synthesis Telephony Based upon the Maximum 

15 Likelihood Method," RepoHs of 6th Int. Cong. Acoust., Tokyo, Japan, Paper C-5-5, pp. C17-20, 

16 1968; and Kwon and Goldberg, "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch," 

17 IEEE Trans, on Acoust, Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 

18 1984. In these excitation models, a white noise source is added to a white periodic source. The 

19 mixture ratio between these sources is estimated from the height of the peak of the 

20 autocorrelation of the LPC residual. 

21 In excitation models having time-varying spectral shapes, the excitation signal consists of 

22 the sum of a periodic source and a noise source with time varying spectral envelope shapes. 

23 Examples of such models are decribed by Fujimara, "An Approximation to Voice Aperiodicity," 

24 IEEE Trans. Audio and Electroacoust, pp. 68-72, March 1968; Makhoul et al, "A Mixed-Source 

25 Excitation Model for Speech Compression and Synthesis," IEEE Int. Conf. on Acoust. Sp. & Sig. 

26 Proc, April 1978, pp. 163-166; Kwon and Goldberg, "An Enhanced LPC Vocoder with No 

27 Voiced/Unvoiced Switch," IEEE Trans, on Acoust, Speech, and Signal Processing, vol. ASSP-32, 

28 no. 4, pp. 851-858, August 1984; and Grifiin and Lim, "Multiband Excitation Vocoder," IEEE 

29 Trans. Acoust, Speech, Signal Processing, vol. ASSP-36, pp. 1223-1235, Aug. 1988. 

30 In the excitation model proposed by Fujimara, the excitation spectrum is divided into 

31 three fixed frequency bands. A separate cepstral analysis is performed for each frequency band 

32 and a voiced/unvoiced decision for each frequency band is made based on the height of the 



1 cepstrum peak as a measure of periodicity. 

2 In the excitation model proposed by Makhoul et al., the excitation signal consists of the 

3 sum of a low-pass periodic source and a high-pass noise source. The low-pass periodic source is 

4 generated by filtering a white pulse source with a variable cut-off low-pass filter. Similarly, the 

5 high-pass noise source was generated by filtering a white noise source with a variable cut-off 

6 high-pass filter. The cut-off frequencies for the two filters are equal and are estimated by choosing 

7 the highest frequency at which the spectrum is periodic. Periodicity of the spectrum is determined 

8 by examining the separation between consecutive peaks and determining whether the separations 

9 are the same, within some tolerance level. 

10 In a second excitation model implemented by Kwon and Goldberg, a pulse source is passed 

11 through a variable gain low-pass filter and added to itself, and a white noise source is passed 

12 through a variable gain high-pass filter and added to itself. The excitation signal is the sum of the 

13 resultant pulse and noise sources with the relative amphtudes controUed by a voiced/unvoiced 
^ 14 mixture ratio. The filter gains and voiced/unvoiced mixture ratio are estimated from the LPC 
^ 15 residual signal with the constraint that the spectral envelope of the resultant excitation signal is 
llf? 16 fiat. 

f J 17 III the multiband excitation model proposed by Griffin and Lim, a frequency dependent 

18 voiced/un voiced mixture function is proposed. This model is restricted to a frequency dependent 

19 binary voiced/unvoiced decision for coding purposes. A further restriction of this model divides 
^r^^ 20 the spectrum into a finite number of firequency bands with a binary voiced/unvoiced decision for 
O 21 each band. The voiced/unvoiced information is estimated by comparing the speech spectrum to 
rf 22 the closest periodic spectrum. When the error is below a threshold, the band is marked voiced, 

23 otherwise, the band is marked unvoiced. 

24 The Fourier transform of the windowed signal s(t, n) will be denoted by S{t,uj) and will be 

25 referred to as the signal Short-Time Fourier Transform (STFT), Suppose so{n) is a periodic signal 

26 with a fundamental frequency ujq or pitch period no- The parameters ivo and no are related to 

27 each other by 27r/a;o ^ uq^ Non-integer values of the pitch period no are often used in practice. 

28 A speech signal 5o(n) can be divided into multiple frequency bands using bandpass filters. 

29 Characteristics of these bandpass filters are allowed to change as a function of time and/or 

30 frequency. A speech signal can also be divided into multiple bands by applying frequency windows 

31 or weightings to the speech signal STFT 5(t, a;). 
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Summary 



2 In one aspect, generally, methods for synthesizing high quality speech use an improved 

3 speech model. The improved speech model is augmented beyond the time and frequency 

4 dependent voiced/unvoiced mixture function of the multiband excitation model to allow a mixture 

5 of three different signals. In addition to parameters which control the proportion of quasi-periodic 

6 and noise-like signals in each frequency band, a parameter is added to control the proportion of 

7 pulse-like signals in each frequency band. In addition to the typical fundamental frequency 

8 parameter of the voiced excitation, additional parameters are included which control one or more 

9 pulse amplitudes and positions for the pulsed excitation. This model allows additional features of 

10 speech and audio signals important for high quality reproduction to be efficiently modeled, 

11 In another aspect, generally, analysis methods are provided for estimating the improved 

12 speech model parameters. For pulsed parameter estimation, an error criterion with reduced 

13 sensitivity to time shifts is used to reduce computation and improve performance. Pulsed 

14 parameter estimation performance is further improved using the estimated voiced strength 

15 parameter to reduce the weighting of frequency bands which are strongly voiced when estimating 

16 the pulsed parameters. 

17 In another aspect, generally, methods for quantizing the improved speech model 

18 parameters are provided. The voiced, unvoiced, and pulsed strength parameters are quantized 

19 using a weighted vector quantization method using a novel error criterion for obtaining high 

20 quality quantization. The fundamental frequency and pulse position parameters are efficiently 

21 quantized based on the quantized strength parameters. 

22 In one general aspect, a method of analyzing a digitized signal to determine model 

23 parameters for the digitized signal is provided. The method includes receiving a digitized signal, 

24 determining a voiced strength for the digitized signal by evaluating a first function, and 

25 determining a pulsed strength for the digitized signal by evaluating a second function. The voiced 

26 strength and the pulsed strength may be determined, for example, at regular intervals of time. In 

27 some implementations, the voiced strength and the pulsed strength may be determined on one or 

28 more frequency bands. In addition, the same function may be used as both the first function and 

29 the second function. 

30 The voiced strength and the pulsed strength may be used to encode the digitized signal. In 

31 some implementations, the pulse signal may be determined using a pulse signal estimated from 

32 the digitized signal. The voiced strength may also be used in determining pulsed strength. 



1 Additionally, the pulsed signal may be determined by combining a transform magnitude with a 

2 transform phase computed from a transform magnitude. The transform phase may be near 

3 minimum phase. In some implementations, the pulsed strength may be determined using a pulsed 

4 signal estimated from a pulse signal and at least one pulse position. 

5 The pulsed strength may be determined by comparing a pulsed signal with the digitized 

6 signal. The comparison may be made using an error criterion with reduced sensitivity to time 

7 shifts. The error criterion may compute phase differences between frequency samples and may 

8 remove the eflfect of constant phase differences. Additional implementations of the method of 

9 analyzing a digitized signal further include quantizing the pulsed strength using a weighted vector 

10 quantization, and quantizing the voiced strength using weighted vector quantization. The voiced 

11 strength and the pulsed strength may be used to estimate one or more model parameters. 

12 Implementations may also include determining the unvoiced strength. 

13 In another general aspect, a method of synthesizing a signal is provided including 

14 determining a voiced signal, determining a voiced strength, determining a pulsed signal, 

15 determining a pulsed strength, dividing the voiced signal and the pulsed signal into two or more 

16 frequency bands, and combining the voiced signal and the pulsed signal based on the voiced 

17 strength and the pulsed strength. The pulsed signal may be determined by combining a transform 

18 magnitude with a transform phase computed from the transform magnitude. 

19 In another general aspect, a method of synthesizing a signal is provided. The method 

20 includes determining a voiced signal; determining a voiced strength; determining a pulsed signal; 

21 determining a pulsed strength; determining an unvoiced signal; determining an unvoiced strength; 

22 dividing the voiced signal, pulsed signal, and unvoiced signal into two or more frequency bands; 

23 and combining the voiced signal, the pulsed signal, and the unvoiced signal based on the voiced 

24 strength, the pulsed strength, and the unvoiced strength. 

25 In another general aspect, a method of quantizing speech model parameters is provided. 

26 The method includes determining the voiced error between a voiced strength parameter and 

27 quantized voiced strength parameters, determining the pulsed error between a pulsed strength 

28 parameter and quantized pulsed strength parameters, combining the voiced error and the pulsed 

29 error to produce a total error, and selecting the quantized voice strength and the quantized pulsed 

30 strength which produce the smallest total error. 

31 In another general aspect, a method of quantizing speech model parameters is provided. 

32 The method includes determining a quantized voiced strength, determining a quantized pulsed 



1 strength. The method further includes either quantizing a fundamental frequency based on the 

2 quantized voice strength and the quantized pulsed strength or quantizing a pulse position based 

3 on the quantized voiced strength and the quantized pulsed strength. The fundamental frequency 

4 may be quantized to a constant when the quantized voiced strength is zero for all frequency bands 

5 and the pulse position may be quantized to a constant when the quantized voiced strength is 

6 nonzero in any frequency band. 

7 The details of one or more implementations are set forth in the accompanying drawings 

8 and the description below. Other features and advantages will be apparent from the description 

9 and drawings, and from the claims. 

10 Brief Description of the Drawings 

11 Fig. 1 is a block diagram of a speech synthesis system using an improved speech model. 

12 Fig. 2 is a block diagram of an analysis system for estimating parameters of the improved 

13 speech model. 

14 Fig. 3 is a block diagram of a pulsed analysis unit that may be used with the analysis 

15 system of Fig. 2. 

16 Fig. 4 is a block diagram of a pulsed analysis with reduced complexity. 

17 Fig. 5 is a block diagram of an excitation parameter quantization system. 

18 Detailed Description 

19 Figs. 1-5 show the structure of a system for speech coding, the various blocks and units of 

20 which may be implemented with software. 

21 Fig. 1 shows a speech synthesis system 10 that uses an improved speech model which 

22 augments the typical excitation parameters with additional parameters for higher quality speech 

23 synthesis. Speech synthesis system 10 includes a voiced synthesis unit 11, an unvoiced synthesis 

24 unit 12, and a pulsed synthesis unit 13. The signals produced by these units are added together 

25 by a summation unit 14. 

26 In addition to parameters which control the proportion of quasi-periodic and noise-like 

27 signals in each frequency band, a parameter is added which controls the proportion of pulse-like 

28 signals in each frequency band. These parameters are functions of time (t) and frequency {cj) and 

29 are denoted by V{t,u) for the quasi-periodic voiced strength, U{t,io) for the noise-hke unvoiced 

30 strength, and cc;) for the pulsed signal strength. Typically, the voiced strength parameter 



1 V(t, oj) varies between zero indicating no voiced signal at time t and frequency w and one 

2 indicating the signal at time t and frequency w is entirely voiced. The unvoiced strength and pulse 

3 strength parameters behave in a similar manner. Typically, the voiced strength parameters are 

4 constrained so that they sum to one (i.e., V(t,uj) + U{t,u) + P{t,uj) = 1). 

5 The voiced strength parameter V{t, oj) has an associated vector of parameters v{t, uj) which 

6 contains voiced excitation parameters and voiced system parameters. The voiced excitation 

7 parameters can include a time and frequency dependent fundamental frequency wo(i,w) (or 

8 equivalently a pitch period no(t,a;)). In this implementation, the unvoiced strength parameter 

9 U{t, oj) has an associated vector of parameters uit, u) which contains unvoiced excitation 

10 parameters and unvoiced system parameters. The unvoiced excitation parameters may include, for 

11 example, statistics and energy distribution. Similarly, the pulsed excitation strength parameter 

12 P{t,u) has an associated vector of parameters p(t,uj) containing pulsed excitation parameters and 

13 pulsed system parameters. The pulsed excitation parameters may include one or more pulse 

14 positions to{t, w) and amplitudes. 

15 The voiced parameters V{t,u;) and v{t,io) control voiced synthesis unit 11. Voiced 

16 synthesis unit 11 synthesizes the quasi-periodic voiced signal using one of several known methods 

17 for synthesizing voiced signals. One method for synthesizing voiced signals is disclosed in U.S. 

18 Pat. No. 5,195,166, titled "Methods for Generating the Voiced Portion of Speech Signals," which 

19 is incorporated by reference. Another method is that used by the MBE vocoder which sums the 

20 outputs of sinusoidal oscillators with amplitudes, frequencies, and phases that are interpolated 

21 from one frame to the next to prevent discontinuities. The frequencies of these oscillators are set 

22 to the harmonics of the fundamental (except for small deviations due to interpolation). In one 

23 implementation, the system parameters are samples of the spectral envelope estimated as 

24 disclosed in U.S. Pat. No. 5,754,974, titled "Spectral Magnitude Representation for Multi-Band 

25 Excitation Speech Coders," which is incorporated by reference. The amplitudes of the harmonics 

26 are weighted by the voiced strength V{t, u>) as in the MBE vocoder. The system phase may be 

27 estimated from the samples of the spectral envelope as disclosed in U.S. Pat. No. 5,701,390, titled 

28 "Synthesis of MBE-Based Coded Speech using Regenerated Phase Information," which is 

29 incorporated by reference. 

30 The unvoiced parameters U{t,u}) and uit,u) control unvoiced synthesis unit 12. Unvoiced 

31 synthesis unit 12 synthesizes the noise-like unvoiced signal using one of several known methods for 

32 synthesizing unvoiced sig nals. One method is that used by the MBE vocoder which generates 
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1 samples of white noise. These white noise samples are then transformed into the frequency 

2 domain by applying a window and fast Fourier transform (FFT). The white noise transform is 

3 then multiplied by a noise envelope signal to produce a modified noise transform. The noise 

4 envelope signal adjusts the energy around each spectral envelope sample to the desired value. The 

5 unvoiced signal is then synthesized by taking the inverse FFT of the modified noise transform, 

6 applying a synthesis window, and overlap adding the resulting signals from adjacent frames. 

7 The pulsed parameters w) and p{t,u}) control pulsed synthesis unit 13. Pulsed 

8 synthesis unit 13 synthesizes the pulsed signal by synthesizing one or more pulses with the 

9 positions and amplitudes contained in p{t,oj) to produce a pulsed excitation signal. The pulsed 

10 excitation is then passed through a filter generated from the system parameters. The magnitude 

11 of the filter as a function of frequency u is weighted by the pulsed strength w). Alternatively, 

12 the magnitude of the pulses as a function of frequency can be weighted by the pulsed strength. 

13 The voiced signal, unvoiced signal, and pulsed signal produced by units 11, 12, and 13 are 

14 added together by summation unit 14 to produce the synthesized speech signal. 

15 Fig. 2 shows a speech analysis system 20 that estimates improved model parameters from 

16 an input signal. The speech analysis system 20 includes a samphng unit 21, a voiced analysis unit 

17 22, an unvoiced analysis unit 23, and a pulsed analysis unit 24. The samphng unit 21 samples an 

18 analog input signal to produce a speech signal So{n). It should be noted that sampling unit 21 

19 operates remotely from the analysis units in many applications. For typical speech coding or 

20 recognition appHcations, the sampling rate ranges between 6 kHz and 16 kHz. 

21 The voiced analysis unit 22 estimates the voiced strength V{t, uj) and the voiced 

22 parameters v{t,u}) from the speech signal so(n). The unvoiced analysis unit 23 estimates the 

23 unvoiced strength !7(i, w) and the unvoiced parameters u(t,oj) from the speech signal So(n). The 

24 pulsed analysis unit 24 estimates the pulsed strength P{t,u) and the pulsed signal parameters 

25 p{t,Lo) from the speech signal so(n). The vertical arrows between analysis units 22-24 indicate 

26 that information flows between these units to improve parameter estimation performance. 

27 The voiced analysis and unvoiced analysis units can use known methods such as those used 

28 for the estimation of MBE model parameters as disclosed in U.S. Pat. No. 5,715,365, titled 

29 "Estimation of Excitation Parameters" and U.S. Pat. No. 5,826,222, titled "Estimation of 

30 Excitation Parameters," both of which are incorporated by reference. The described 

31 implementation of the pulsed analysis unit uses new methods for estimation of the pulsed 

32 parameters. 
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1 Referring to Fig. 3, the pulsed analysis unit 24 includes a window and Fourier transform 

2 unit 31, an estimate pulse FT and synthesize pulsed FT unit 32, and a compare unit 33. The 

3 pulsed analysis unit 24 estimates the pulsed strength P{t, cj) and the pulsed parameters p{t, cj) 

4 from the speech signal So{n). 

5 The window and Fourier transform unit 31 multiplies the input speech signal So{n) by a 

6 window w{t, n) centered at time t to obtain a windowed signal s{t, n). The window used is 

7 typically a Hamming window or Kaiser window and is typically constant as a function of t so that 

8 w{t, n) = Wo(n - t). The length of the window w{t, n) typically ranges between 5 ms and 40 ms. 

9 The Fourier transform (FT) of the windowed signal S{t,uj) is typically computed using a fast 

10 Fourier transform (FFT) with a length greater than or equal to the number of samples in the 
u window. When the length of the FFT is greater than the number of windowed samples, the 
12 additional samples in the FFT are zeroed. 

,3 The estimate pulse FT and synthesize pulsed FT unit 32 estimates a pulse from S{t, u) 

14 and then synthesizes a pulsed signal transform S{t,oj) from the pulse estimate and a set of pulse 

15 positions and amplitudes. The synthesized pulsed transform is then compared to the 

16 speech transform S{t, u) using compare unit 33. The comparison is performed using an error 

17 criterion. The error criterion can be optimized over the pulse postions, amplitudes, and pulse 

18 shape. The optimum pulse positions, amplitudes, and pulse shape become the pulsed signal 

19 parameters v{t,u). The error between the speech transform S{t,oj) and the optimum pulsed 

20 transform lj) is used to compute the pulsed signal strength P(i, w). 

21 A number of techniques exist for estimating the pulse Fourier transform. For example, the 

22 pulse can be modeled as the impulse response of an all-pole filter. The coefficients of the all-pole 

23 filter can be estimated using well known algorithms such as the autocorrelation method or the 

24 covariance method. Once the pulse is estimated, the pulsed Fourier transform can be estimated by 

25 adding copies of the pulse with the positions and amplitudes specified. The pulsed Fourier 

26 transform is then compared to the speech transform using an error criterion such as weighted 

27 squared error. The error criterion is evaluated at all possible pulse positions and ampUtudes or 

28 some constrained set of positions and amplitudes to determine the best pulse positions, 

29 amplitudes, and pulse FT. 

30 Another technique for estimating the pulse Fourier transform is to estimate a minimum 

31 phase component from the magnitude of the short time Fourier transform (STFT) \S{t, u})\ of the 

32 speech. This minimum phase component may be combined with the speech transform magnitude 
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1 to produce a pulse transform estimate. Other techniques for estimating the pulse Fourier 

2 transform include pole-zero models of the pulse and corrections to the minimum phase approach 

3 based on models of the glottal pulse shape. 

4 Some implementations empty an error criterion having reduced sensitivity to time shifts 

5 (linear phase shifts in the Fourier transform). This type of error criterion can lead to reduced 

6 computational requirements since the number of time shifts at which the error criterion needs to 

7 be evaluated can be significantly reduced. In addition, reduced sensitivity to linear phase shifts 

8 improves robustness to phase distortions which are slowly changing in frequency. These phase 

9 distortions are due to the transmission medium or deviations of the actual system from the model. 

10 For example, the following equation may be used as an error criterion: 

E{t) = min r G{t,uj) \s{t,uj)S%t,u) - Acj) - e^^ S{t,u)S''{t,u) - I^lo)^ duj (1) 

11 In Equation (1), S{t,u) is the speech STFT, S{t,u) is the pulsed transform, G{t,uj) is a 

12 time and frequency dependent weighting, and (9 is a variable used to compensate for hnear phase 

13 offsets. To see how 6 compensates for linear phase offsets, it is useful to consider an example, 

14 Suppose the speech transform is exactly matched with the pulsed transform except for a linear 

15 phase offset so that S{t,cu) = e~^'^^°S{tyUj). Substituting this relation into Equation (1) yields 

E{t) = mm r Git, uj) \Sit, u)S*{t, u) ~ Au) [l - e^'^^-^'^*^)] f du (2) 

16 which is minimized over 9 at 9min — Auto, Iri addition, once 9min is known, the time shift to can 

17 be estimated by 

18 where Auj is typically chosen to be the frequency interval between adjacent FFT samples. 

19 Equation (1) is minimized by choosing 9 as follows 



^mini^) = arctan 



r Git,uj)S{t,uj)S*{t,ij~ Auj)S*{t,uj)S{t,u - Auj)dJ . (4) 

20 When computing 9min{t) using Equation (4), if = 1, the frequency weighting is 

21 approximately \S{t,(jj)\^. This tends to weight frequency regions with higher energy too heavily 

22 relative to frequency regions of lower energy. G(t,a;) may be used to adjust the frequency 

23 weighting. The following function for G(t, lu) may be used to improve performance in typical 

24 appHcations: 
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1 where F{t^uj) is a time and frequency weighting function. There are a number of choices for 

2 F[t^(jS) which are useful in practice. These include F{t,u) — 1, which is simple to implement and 

3 achieves good results for many applications. A better choice for many applications is to make 

4 F{t^ uS) larger in frequency regions with higher pulse-to-noise ratios and smaller in regions with 

s lower pulse-to-noise ratios. In this case, "noise" refers to non-pulse signals such as quasi-periodic 

6 or noise-like signals. In one implementation, the weighting F{t, uj) is reduced in frequency regions 

7 where the estimated voiced strength V[t^uj) is high. In particular, if the voiced strength V{t,io) is 
e high enough that the synthesized signal would consist entirely of a voiced signal at time t and 

9 frequency uj then F{t^oj) would have a value of zero. In addition, F(t^oj) is zeroed out for u < 400 

10 Hz to avoid deviations from minimum phase typically present at low frequencies. Perceptually 

11 based error criteria can also be factored into F{t,u) to improve performance in applications where 

12 the synthesized signal is eventually presented to the ear. 

13 After computing Ommit)^ a* frequency dependent error E[t^uj) may be defined as: 

E{t,uj) = G[t,uj) \S{t,Lo)S^[t,LO - Au) - e^'^— 5(t,a;)5*(t,a^ - Aa;)|^ . (6) 

14 The error E(t,uj) is useful for computation of the pulsed signal strength P{t,U)), When computing 

15 the error E{t,uj), the weighting function F{t,uj) is typically set to a constant of one. A small 
15 value of E{t^u)) indicates similarity between the speech transform S{t^ui} and the pulsed 

17 transform S{t,(jj), which indicates a relatively high value of the pulsed signal strength P{t,uj). A 

18 large value of E{t,u)) indicates dissimilarity between the speech transform S{t^<jj) and the pulsed 

19 transform S{t,oj), which indicates a relatively low value of the pulsed signal strength P{t,uj). 

20 Fig. 4 shows a pulsed Analysis unit 24 that includes a window and FT unit 41 ^ a synthesize 

21 phase unit 42, and a minimize error unit 43. The pulsed analysis unit 24 estimates the pulsed 

22 strength P{t,u) and the pulsed parameters from the speech signal 5o(n) using a reduced 

23 complexity implementation. The window and FT unit 41 operates in the same manner as 

24 previously described for unit 31. In this implementation, the number of pulses is reduced to one 

25 per frame in order to reduce computation and the number of parameters. For applications such as 

26 speech codingj reduction of the number of parameters is helpful for reduction of speech coding 

27 rates. The synthesize phase unit 42 computes the phase of the pulse Fourier transform using well 

28 known homomorphic vocoder techniques for computing a Fourier transform with minimum phase 
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1 from the magnitude of the speech STFT \S{t,u)\. The magnitude of the pulse Fourier transform 

2 is set to \S{t,uj)\. The system parameter output p(t, w) consists of the pulse Fourier transform. 

3 The minimize error unit 43 computes the pulse position to using Equations (3) and (4). For 

4 this implementation, the pulse position to{t,u) varies with frame time t but is constant as a 

5 function of u. After computing Omin, the frequency dependent error E{t,u) is computed using 

6 Equation (6). The normalizing function D{t,Lo) is computed using 

D{t,uj) = G{t,oj) \S{t,uj)S%t,uj~ Aujf (7) 

7 and applied to the computation of the pulsed excitation strength 

P{t,uj)^l P\t,uj), 0<Fit,uj)<l (8) 
1, P'{t,oj) > 1 

8 where 

-(-)^^>-(^). 

9 E{t,u) and D{t,uj) are frequency smoothed versions of E[t,u)) and D{t^uj), and r is a threshold 

10 typically set to a constant of O.L Since E{t,io) and D{t,uj) are frequency smoothed (low pass 
u filtered), they can be downsampled in frequency without loss of information. In one 

12 implementation, E{t,u) and D{t,uj) are computed for eight frequency bands by summing E{t,uj) 

13 and D{t^uj) over all in a particular frequency band. Typical band edges for these 8 frequency 

14 bands for an 8 kHz samphng rate are 0 Hz, 375 Hz, 875 Hz, 1375 Hz, 1875 Hz, 2375 Hz, 2875 Hz, 

15 3375 Hz, and 4000 Hz. 

16 It should be noted that the above frequency domain computations are typically carried out 

17 using frequency samples computed using fast Fourier transforms (FFTs). Then, the integrals are 

18 computed using summations of these frequency samples. 

19 Referring to Fig. 5, an excitation parameter quantization system 50 includes a 

20 voiced/unvoiced/pulsed (V/U/P) strength quantizer unit 51 and a fundamental and pulse position 

21 quantizer unit 52. Excitation parameter quantization system 50 jointly quantizes the voiced 

22 strength V{t,u), the unvoiced strength U{t,u)), and the pulsed strength P{t,u)) to produce the 

23 quantized voiced strength V{t,uj), the quantized unvoiced strength U{t,u)), and the quantized 

24 pulsed strength P{tyu) using V/U/P strength quantizer unit 51. Fundamental and pulse position 

12 
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1 quantizer unit 52 quantizes the fundamental frequency u)Q{t^uS) and the pulse position t(^{t^Lo) 

2 based on the quantized strength parameters to produce the quantized fundamental frequency 

3 (bQ{t^u) and the quantized pulse position {^{t.oj), 

4 One implementation uses a weighted vector quantizer to jointly quantize the strength 

5 parameters from two adjacent frames using 7 bits. The strength parameters are divided into 8 

6 frequency bands. Typical band edges for these 8 frequency bands for an 8 kHz sampling rate are 0 

7 Hz, 375 Hz, 875 Hz, 1375 Hz, 1875 Hz, 2375 Hz, 2875 Hz, 3375 Hz, and 4000 Hz. The codebook 

8 for the vector quantizer contains 128 entries consisting of 16 quantized strength parameters for the 

9 8 frequency bands of two adjacent frames. To reduce storage in the codebook, the entries are 

10 quantized so that for a particular frequency band a value of zero is used for entirely unvoiced, one 
u is used for entirely voiced, and two is used for entirely pulsed. 

12 For each codebook index m the error is evaluated using 

1 7 
n=0 A;=0 

13 where 



Era{tn,^k) = max 



. (11) 



14 a{tn,^k) is a frequency and time dependent weighting typically set to the energy in the speech 

15 transform ^(tn, ^k) around time and frequency a/j^, max(a, b) evaluates to the maximum of a or 

16 6, and Vml^n: ^k) and Pm{tn,^k) are the quantized voicing strength and quantized pulsed strength. 

17 The error Em of Equation (10) is computed for each codebook index m and the codebook index is 

18 selected which minimizes Em- 

19 In another preferred embodiment, the error Em{tn,ujk) of Equation (11) is replaced by 



Em{tn, OJk) - 7m (^n, ^Jk) + ^ (l - VU^n, ^k)) (1 " 7m (^n, ^fc)) (^(^n, (^k) " Pm{tn^ ^k)) . (12) 

20 where 

7m(tn,Wfc) = {V{tn,iOk) -Vm{tn,LOk)y (13) 

21 and l3 is typically set to a constant of 0.5. 
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1 If the quantized voiced strength V{t^Lo) is non-zero at any frequency for the two current 

2 frames, then the two fundamental frequencies for these frames are jointly quantized using 9 bits, 

3 and the pulse positions are quantized to zero (center of window) using no bits. 

4 If the quantized voiced strength V{t,{jj) is zero at all frequencies for the two current frames 

5 and the quantized pulsed strength P{t^u) is non-zero at any frequency for the current two frames, 

6 then the two pulse positions for these frames may be quantized using, for example 9 bits, and the 

7 fundamental frequencies are set to a value of, for example, 64.84 Hz using no bits. 

8 If the quantized voiced strength V{t^uj) and the quantized pulsed strength P{t^oj) are both 

9 zero at all frequencies for the current two frames, then the two pulse positions for these frames are 

10 quantized to zero, and the fundamental frequencies for these frames may be jointly quantized 
u using 9 bits. 

12 Other implementations are within the following claims. 

C3 13 What is claimed is: 
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